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Abstract 

This paper proves that visual object recognition sys- 
tems using only 2D Euclidean similarity measure- 
ments to compare object views against previously seen 
views can achieve the same recognition performance 
as observers having access to all coordinate informa- 
tion and able of using arbitrary 3D models internally. 
Furthermore, it demonstrates that such systems do 
not require more training views than Bayes-optimal 
3D model-based systems. For building computer vi- 
sion systems, these results imply that using view- 
based or appearance-based techniques with carefully 
constructed combination of evidence mechanisms may 
not be at a disadvantage relative to 3D model-based 
systems. For computational approaches to human vi- 
sion, they show that it is impossible to distinguish 
view-based and 3D model-based techniques for 3D ob- 
ject recognition solely by comparing the performance 
achievable by human and 3D model-based systems. 

1. Introduction 

View-based or appearance based methods in visual 
object recognition represent 3D objects as a collection 
of views for the purposes of recognition. Many differ- 
ent ways in which these views can be used for recogni- 
tion have been proposed: some compare a target view 
against stored views individually, while others allow 
interpolation or combination among multiple views. 
Some approaches use fixed similarity functions and 
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evidence combination schemes, while others allow for 
the learning or adaptation of either or both. 

One of the most restrictive forms of view-based 
3D object recognition requires that, in order to per- 
form recognition, each stored view is compared with a 
target view using only a fixed, non-invariant similar- 
ity measure. After performing those similarity mea- 
surements, the observer is then permitted to perform 
some kind of "combination of evidence" on them. In 
their papers on human 3D generalization [3] [5] re- 
fer to such an observer as an observer using a strong 
view- approximation method: 

"For example, assume that an object is rep- 
resented by two independent views. The task 
is to decide whether a novel view belongs 
to the object. The strong version of view- 
approximation maintains that in order to 
recognize a novel view, a similarity measure 
is calculated independently between this view 
and each of the two stored views [. . .]. Recog- 
nition is a function of these measurements. 
The simplest function is the nearest neigh- 
bor scheme, where a match is based on the 
closest view in memory. A more sophis- 
ticated scheme is the Bayes classifier that 
combines the evidence over the collection of 
views optimally. " J5jj 

Let us express this notion of "strong view- 
approximation" formally. We will call an observer 
using a strong version of the view-approximation 
method3 a "strongly two-dimensional observer" : 
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Definition 1 Let % = {T Ut i : u> £ O, i = 1, . .. , r w } 

&e a collection of N 2D training views for objects 
u> 6 SI = {uj\, . . . , uipf}. Let S(U, V) be a real-valued 
function of 2D views, the view similarity measure. 
Then, a strongly two-dimensional observer is an 
observer that classifies an unknown target view V us- 
ing a decision procedure D(V) of the form 

D(V) = f(S(V,T U)ul ),S(V,T Uu2 ),...,S(V,T UNtrN )) 

That is, a strongly two-dimensional observer classi- 
fies objects only based on some functional combina- 
tion f of the individual 2D similarities of the target 
view to each of the training views. 

Note that the observer is permitted to take into 
account in his decision similarities to both matching 
and non-matching object^. For example, in near- 
est neighbor methods, we compare similarities from 
both matching and non-matching objects in order to 
find the view having the highest similarity value (i.e., 
smallest Bayes-optimal distance). 

Intuitively, it would seem that a strongly two- 
dimensional observer should be limited in his abil- 
ity to perform recognition and should therefore make 
more recognition errors than an observer capable of 
performing full, 3D modeling and recognition. In this 
paper, I demonstrate that that is not the case: given 
the correct Bayesian combination of the individual 
view similarity values, a strongly two-dimensional ob- 
server can achieve the same Bayes-optimal error rate 
as an observer that can access all the coordinate mea- 
surements of the target and training views and uses 
explicit 3D models internally. This is demonstrated 
by showing that an observer can reconstruct the orig- 
inal training and target views well enough from the 
similarity values to be able to perform Bayes-optimal 
3D recognition. Furthermore, I show that the same 
result holds true for model acquisition: a strongly 



sion of view approximation, in which the observer is permitted 
to perform geometric transformations on the target or train- 
ing view. Since we demonstrate in this paper that the strong 
view-approximation method is already sufficient for achieving 
Bayes-optimal 3D performance, we need not consider "more 
flexible" models. 

2 However, while this perhaps the most plausible definition, 
the results of this paper do not depend on it; see Appendix B. 



two-dimensional observer can acquire object models 
just as quickly and reliably from view similarity val- 
ues as an observer having full access to views. 

2. B ayes- Optimal 3D Recogni- 
tion 

Assume that we are trying to identify which of a num- 
ber of possible objects lo is represented by some view 
V of the object. The Bayes-optimal minimum error 
decision procedure D (V) for this problem is to deter- 
mine the object with the largest posterior probability 
given the image: 

D(V) = axgm&x P(u\V) 

UJ 

Via Bayes rule, we can compute P{lo\V) in terms of 
the likelihood P(V\u>): 



P{uj\V) 



P{V\lo)P{lo) 



Since P(V) is independent of the object, our decision 
procedure then simply becomes 

D(V) = argmaxP(V».P(cj) 

UJ 

Now, let Mu be the true 3D model corresponding 
to object lo, let R be the 3D object transformation 
and imaging transformation, and let N be the noise 
and uncertainty introduced by the imaging process. 
Then, the target view V is distributed as 



V ~ R(M U 



N 



Here, M u , R, and N are all random variables. In 
different words, we can write down a conditional dis- 
tribution of V given R, M w , and N. However, R and 
N are unobservable. Hence, a Bayes-optimal 3D ob- 
server needs to take into account his prior knowledge 
about the distribution of those variables to arrive at 
an expression for P(V\lu): 



Pm(V\lu) = P(V\M^ 



P{V\M U , R, N)P{R, N\M u )dNdR 



Note that we allow both the distribution of noise N 
and the distribution of views R to depend on the 
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model; commonly (though not necessarily correctly), 
it is assumed that these are independent, so that 
P(R,N\M U ) = P(R)P(N). 

By construction, an observer using Pm(V\lo) is us- 
ing the Bayes-optimal object recognition procedure for 
3D objects from 2D views and achieves the Bayes- 
optimal error rate on the recognition problem given 
M u . 

In actual practice, an observer almost never knows 
the true 3D object model M u , but needs to recon- 
struct it from a given set of training views 7~ u — 
{T Ut i, . . . ,T Wyr }. In general, the 3D model cannot 
be reconstructed unambiguously from the training 
views, due to noise, uncertainty, ambiguity, and/or 
occlusions. Therefore, the observer really can only es- 
timate a distribution P(M LJ \T UJ ) and the actual model 
also becomes a latent variable: 

Pt(V\lu) = P(V\%) (1) 
= J P(V\M U ,R,N)P(R,N\M U ) 
■P{M u \%)dNdRdM u 

An observer using Pt(V\lo) is the Bayes-optimal 3D 
observer based on a set of 2D training views and 
achieves minimum recognition error for the given 
prior distributions. 

The difference between Pm(V\uj) and Pr(V\u>) is 
crucial: an observer having a priori knowledge of 
the correct 3D structure M w of object u) can easily 
outperform an observer who has to estimate such a 
model from training views 7^. However, where would 
an observer obtain exact knowledge of M^l The ob- 
server might have access to information beyond a set 
T u of given training views, such as information de- 
rived from touch or a given CAD (computer-aided 
design) blueprint; but then we are comparing the per- 
formance of view-based recognition against the per- 
formance of an observer that has additional informa- 
tion. 

The observer might also try to perform an "op- 
timal reconstruction" M u of M w based on 7^ (e.g., 
using a maximum-likelihood procedure, maximum a 
posteriori-MAP, or least-square reconstruction) and 
use that for matching; but that would merely amount 
to picking P(M U \T U ) = 8{M W ,M 0J ), which is al- 



most certainly not the correct distribution and would 
in general result in worse performance than the 
Bayes optimal solution using the correct distribution 
P(M [A] \T [A j)] we will return to this issue below. 

Therefore, the question of whether strongly view- 
based recognition performs worse than a 3D recog- 
nition system only makes much sense if we give 
both methods the same input data. In the case of 
3D model-based recognition, we expect that the 3D 
model-based observer should perform Bayes-optimal 
reconstruction of the 3D models compatible with the 
training views, resulting in a distribution P(M UJ \T UJ ), 
and then would use that distribution of models for 
recognition, as described by Equation Q] 

Note that we have, so far, not made any assump- 
tions about the representation of models or views; the 
above expressions are true for collections of point fea- 
tures as much as they are true for grayscale images. 
However, it is common in the literature [TU] [5] [5] [5] 
to examine the special case in which images are or- 
dered collections of k points in M 2 , for some fixed 
k, models are correspondingly ordered collections of 
k points in R 3 , noise N has a Gaussian distribution 
around each image point, and transformations consist 
of 3D rotations followed by orthographic projection. 
For this formalization of the 3D object recognition 
problem, views are vectors in M. 2k and models are 
vectors in M 3fc . For concreteness and for a connec- 
tion with prior work, we use the same representation 
when talking about a concrete instance of the recog- 
nition problem. However, the derivations go through 
for other kinds of representations and depend only on 
the use of Euclidean distances of views represented as 
vectors^] in M™. 

3. View-Based Recognition 

Let us now show that strong view-approximation 
methods can achieve Bayes-optimal 3D recognition 
performance for feature-based object recognition. In 

3 Note particular that choosing to represent views as vectors 
in K™ does not imply knowledge of feature correspondences; 
for example, even if a view V £ M. 2k represents the 2D coor- 
dinates of feature points in that view, they might simply be 
ordered lexicographically. A representation of the input image 
as a feature map or image also does not convey any feature 
correspondence information. 
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fact, for our construction, we assume that the fixed 
similarity measure used by the strong view-based 
approximation method is simply the Euclidean dis- 
tance. That is, we define the similarity function for 
a view V and a view T as S(V,T) = \\V - T\\. Note 
that in the case where V and T are concatenations 
of the locations of feature points in the image and 
the training view, this is the same as a point wise 
squared error evaluation, v/^iC^i ~U) 2 , where the 
v-i , ti £ R 2 are corresponding feature locations in the 
two vectors. 

When attempting view-based recognition, we are 
comparing our unknown image V against many pre- 
viously stored training views X = 1J T u — {T u ^ : to £ 
Q,i = 1, . . . , r w } C R 2fc . We call this entire collection 
T of training views and their associated object labels 
the model base. 

When attempting to recognize an object from one 
of its views V, a strongly two-dimensional view-based 
observer may take into account the real-valued sim- 
ilarity of the view to each of the training views 
SiVjT^^) and combine them in some way. The 
strongly view-based observer is not permitted to eval- 
uate S for different transformations of the views, or 
to perform calculations involving the coordinates of 
the views, or perform any of the other operations 
that model-based or view-based recognition systems 
commonly perform (e.g., [2], [1]). 

The definition of a strongly two-dimensional ob- 
server stated informally by [5] and restated formally 
above does not impose any restrictions on the kinds 
of knowledge an observer has about the models in 
the model base, or the kinds of computations an ob- 
server may perform on those models. However, since 
we think of visual systems as operating on-line and 
acquiring models incrementally, we impose here the 
further restriction on the strongly two-dimensional 
observer that his entire knowledge about the object 
in the model base is limited to knowledge about their 
pairwise similarities SiT^.i, This strengthens 

the result because it shows that an observer having 
even less information than that required by the def- 
inition of the strongly two-dimensional view-based 
observer can still perform Bayes-optimal 3D recog- 
nition. 

Let us call this entire collection of similarity mea- 



surements between the training view and the views 
in the model base, together with the pairwise sim- 
ilarities of views in the model base, 6(V,1). A 
Bayes-optimal observer will combine them in a Bayes- 
optimal way. We will show the following theorem: 

Theorem 1 Let V and T^j be object views repre- 
sented as vectors in M . The collection of Eu- 
clidean similarity measurements S'fV, T^f) against 
almost any model base of size N > 2k is sufficient 
for performing Bayes-optimal 3D recognition. 

To show this, we will show that an observer can re- 
construct the V and given 6(V, 1), up to trans- 
formations that do not affect classification. To estab- 
lish this, we use the following Lemma: 

Lemma 1 For a collection of N distinct vectors 
Pi, . . . ,Pn that span R™, if N > n, we can recon- 
struct the coordinates of the vectors from the collec- 
tion of Euclidean distances dij = \\pi — Pj\\ up to a 
global translation, a global rotation, and mirror re- 
versal. 

Proof. See Appendix A. 

Proof of Theorem [71 As defined above, the target 
view V and each of the N training views T u ^ is rep- 
resented as a point in R 2fc . We identify n = 2k. 
Furthermore, we have the set of similarity measure- 
ments &{V,%). We identify the similarity measure- 
ments comparing only the N views in the model base 
with the d^ in the Lemma. Lemma Q] tells us that 
if the model base contains at least 2k training views, 
then we can reconstruct the model base and the tar- 
get view from those similarity measurements, up to 
a single global transformation G (translation, global 
rotation, and mirror reversal), provided that the set 
of training views spans R 2fc . 

This will be true for almost all collections of N 
training views, for the following reason. Consider 
the concatenation of the N training views into a vec- 
tor p in the space of N n-dimensional vectors, i.e., 
M. n . This collection of vectors can fail to satisfy 
the requirements of Lemma [T] either by not spanning 
R n or by having two vectors be identical. Either of 
these is easily seen to constrain p to lie on a subman- 
ifold of M. N ' n of measure zero. Since there is only a 
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finite number those constraints, their union still has 
measure zero. 

If we have some procedure for inferring G, then the 
proof is done at this point: we can compute G -1 , re- 
construct the target view V and the set of training 
views T exactly by first applying Lemma [JJ and then 
transforming with G , and finally perform Bayes- 
optimal 3D recognition as defined by Equation [TJ 
This is a construction of a Bayes-optimal 3D recog- 
nition procedure conforming to the requirements of 
Definition [TJ 

For completeness, however, let us assume that G 
cannot be determined but that object identity is in- 
variant under a global translation, rotation, and mir- 
ror reversal transformation G of both the target views 
and all the training views. This means that, for all 
target views V and sets of training views T, our de- 
cision procedure D is invariant under G: 

D(V,%) = D{GV,G1) (2) 

Here GT = {GT\T e %}. If we apply Lemma [TJ it 
will reconstruct for us GV and GT for some such (un- 
known) transformation G. But since, by assumption, 
D(V,T) — D(GV,GT), if we apply our regular deci- 
sion procedure to the transformed training and target 
views, we will be making the same decisions as if we 
had applied them to the original training and target 
views. Since Bayes-optimal 3D recognition, as ex- 
pressed in Equation [TJ is a decision procedure of this 
form, it can be evaluated in this way and will yield 
the same results on the target and training views re- 
constructed from the similarity values as it does on 
on the original target and training views. 

Hence, by first reconstructing the target and train- 
ing views using Lemma [TJ and then applying Equa- 
tion [JJ we have constructed a Bayes-optimal 3D 
recognition procedure using only 2D similarity mea- 
surements between target and training views, as re- 
quired by Definition [TJ □ 

Before continuing, we should note that the appear- 
ance of the global transformation G is simply an arti- 
fact of the use of Euclidean distance as our similarity 
measure, since Euclidean distances are invariant un- 
der this set of transformations. If we pick a similarity 
measure that is not invariant, the uncertainty about 



G disappear. Appendix B contains such a similarity 
measure. 

The reason for using Euclidean distance in these 
derivations is that it is, at the same time, an intu- 
itive similarity measure for similarity of 2D views and 
that the proof of Lemma [JJ is fairly easy. The rota- 
tional invariance, for example, can be eliminated by 
choosing a slightly more complicated similarity func- 
tion S(V, T) = y/YnJ~- (V,-T t ) 2 , but the analogous 
proof for Lemma [TJ becomes more complicated. 

However, the appearance of G is not a particularly 
serious issue. If, in addition to the set of similari- 
ties, we know the actual 2D coordinates of features 
in 2k + 1 training views (for example, from tactile 
input), after applying Lemma [TJ to obtain GV and 
GT, we can use those to determining G -1 and recon- 
struct the target view V and training views exactly. 
Note that Definition [TJ permits such information to 
be available even to a strictly two dimensional view 
based observer. 

Another way of looking at this is that G does not 
affect how we measure translation and rotation of 
different views relative to each other. That is, in- 
formally stated, once we have decided that a certain 
view represents, for example, "vertical" , we can de- 
termine the orientation of other views relative to that 
view even if we don't know G. That situation is some- 
what analogous to phenomena observed in human vi- 
sion, which allow fairly rapid global rcintcrprctation 
of globally transformed visual inputs ; it is equiv- 
alent to saying that G remains unknown but that 
our decision procedure is invariant under G, as in the 
second part of the proof above. 

Note on Model Acquisition. The reader should 
recognize that the "reconstruction" of coordinates 
from similarity measurements is a completely sepa- 
rate computation from the acquisition of 3D models 
from 2D views (e.g., [7]). The reconstruction above 
is concerned with the recovery of 2fc-dimensional 
vectors from internally computed similarity values 
among 2A;-dimensional vectors. In 3D model acqui- 
sition from 2D views, we attempt to combine views 
of an object, possibly subject to sensor noise, into a 
consistent model. 3D model acquisition could be car- 
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ried out after the coordinates of the individual views 
of an object have been reconstructed from similarity 
measurements using the above procedure. 

Other Feature Vectors. The same construction 
as described above applies to many other feature 
types and situations, like grayscale or color images, 
feature locations without correspondences, etc. 

For example, if correspondences between feature 
locations image and training views are not known, 
we can still concatenate the k two-dimensional coor- 
dinates of those feature locations in each view into 
a single vector in some arbitrary order and compute 
similarity, as before, using Euclidean distances. The 
resulting view similarity measure would not be par- 
ticularly nicely behaved, but it would still satisfy the 
criteria of a strongly view based observer. For recog- 
nition using those similarity measures, the observer 
would reconstruct the 2/c-dimensional vectors as be- 
fore and then would have to use some other method 
to find correspondences between different views, just 
as if the observer had been given the original visual 
input instead of similarities. 

Actual Implementations. While the proof of the 
statistical sufficiency of 6(V, X) has involved the re- 
construction of views from similarity measurements, 
this is merely a mathematical device; it does not 
mean that every Bayes-optimal view-based recogni- 
tion system actually has to carry out such a recon- 
struction. Quite to the contrary, given a collection 
of millions of stored training views T, it seems quite 
plausible that even very simple decision functions, 
perhaps even something as simple as a linear dis- 
criminant function on some fixed function g of the 
similarity values, $ W (V) = J2i <Xv,i9(S(V,T Uii )), may 
already represent a close approximation to the Bayes 
optimal error rate and can be expected to converge to 
the Bayes-optimal 3D recognition error rate for large 
enough sets of training views. Note, in particular, 
that Radial Basis Functions (RBFs) are of this form, 
although they are not actually applied in exactly this 
form in the most well-known applications of RBFs to 
3D object recognition [S]. 



4. View-Based Model Acquisi- 
tion 

Given that we have seen that a strongly two- 
dimensional observer can, in fact, perform 3D ob- 
ject recognition as well as a Bayes-optimal 3D model- 
based observer, we might ask the question of whether 
perhaps view-based acquisition of new models re- 
quires more training in order to achieve a compara- 
ble level of performance as direct, coordinate-system 
based 3D model building and model-based recogni- 
tion. 

We have already answered that question implicitly 
in our derivation of Bayes-optimal 3D recognition. 
Bayes-optimal 3D recognition is carried out in terms 
of (estimates of) P{V\%j). It makes no difference 
how a vision system internally computes P{y\TJ). 
The computation may involved the construction of 
explicit 3D object models, or it may be carried out 
in some other way. The computation may be carried 
out at the time when the training views are first en- 
countered, or it may be carried out when the vision 
system is faced with the task of recognizing the ob- 
ject represented by view V . All that matters is that 
the estimate of P(V|7L) ultimately is a good approx- 
imation to the true value. 

Since we have shown in the previous section that 
a strictly two-dimensional observer can reconstruct 
the target and training views perfectly from a set of 
real-valued similarity measurements, if that observer 
chooses to evaluate P(V\T U ) by building a 3D model 
Mu from training views internally (using techniques 
like, e.g., [7]), the observer can simply do this in terms 
of views reconstructed from the similarity measure- 
ments. 

5. 3D Model-Based Recognition 

In the previous sections, we have seen that strongly 
view-based observers can perform Bayes-optimal 3D 
object recognition. We also showed that strongly 
view-based observers can perform model acquisition 
as well as any 3D model-based recognition system. 
In both cases, the reason was that the set of similar- 
ity measurements &(V, %) is essentially equivalent to 
complete knowledge of all the training views and the 
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target view. 

Note that there is a distinction between Bayes- 
optimal 3D recognition and 3D model-based recog- 
nition. Bayes-optimal 3D recognition is simply any 
procedure that achieves Bayes error rates on a 3D 
recognition problem, regardless of what mechanisms 
it uses internally. 3D model-based recognition (at 
least in the sense used in this paper) is based specif- 
ically on object-centered shape models. 

Model-based 3D object recognition has been ar- 
gued for in human vision by Marr [S], but work on 
3D feature-based based object recognition also usu- 
ally assumes the existence of a 3D model (e.g., [2]). 
Such models are usually assumed to be either given 
(for example, from a CAD-computer aided design- 
model of the object), or reconstructed from image 
data (e.g., [ZHi]). 

3D model-based recognition from collections of 2D 
training views divides visual object recognition into 
two steps. First, an object-centered 3D shape model 
M^, is constructed based on the training views T u . 
Then, that 3D shape model is used to find an match. 

In its strictest form, this object centered shape 
model is a maximum likelihood reconstruction or 
maximum a posteriori (MAP) reconstruction Af w (7^) 
of the feature locations in 3D from the set training 
views T^. M w (7^) is then used for performing recog- 
nition. If we assume that the 3D model match against 
the image is carried out in a Bayes-optimal way, this 
means that we use 

P(V\uj) = J P(V\Mu{%),R,N)P(R,N)dNdRdM u 

By comparing Equation [3] against Equation [T] we 
see that this amounts to assuming that P(M\T UJ ) = 
6(M,M U (T)). This is correct (and Bayes-optimal) 
when the object model is known exactly a priori. But 
when the object model has to be reconstructed from 
training data, then, in general, P(M\T LJ ) is not going 
to be a S function. The use of a maximum likelihood 
or maximum a posteriori estimate for the model has 
to be justified as an approximation; it is probably a 
good approximation when many training views are 
available and/or the amount of noise is fairly small. 

Therefore, model-based recognition using the 
"best" (in a maximum likelihood sense) 3D model 



corresponding to the training views does not neces- 
sarily lead to a Bayes-optimal 3D object recognition 
system. To achieve Bayes-optimality, in general, it 
is necessary to model the distribution P(M\T UJ ) cor- 
rectly. 

We can attempt to address this problem by adopt- 
ing statistical 3D shape models. For example, we can 
associated each feature point in the maximum like- 
lihood or MAP reconstruction with error bounds or 
a Gaussian distribution. This, then, gives rise to a 
probability distribution over possible 3D models com- 
patible with the training views. However, this, too, 
only represents an approximation to the true distri- 
bution P(M\T U ) because errors in the reconstruction 
of 3D feature locations can (and usually are) corre- 
lated. 

Overall, we see that using an object centered 3D 
shape model in 3D model-based recognition, possibly 
with an associated error model, is simply a particular 
choice of representation for P(M LU \T U) ). But we have 
seen such uses of 3D models in recognition correspond 
to specific assumptions about P(M UJ \T UJ ), assump- 
tions that may not be satisfied in specific recognition 
problems. Or, to put it more succinctly, combining 
optimal 3D model reconstruction from training views 
with optimal 3D model matching against 2D images 
does not necessarily result in Bayes-optimal 3D recog- 
nition. 

6. Discussion 

A key result of this paper is that a strongly two- 
dimensional observer, that is, an observer that per- 
forms object recognition only in terms of Euclidean 
similarity measures between different views, can 
achieve the same Bayes-optimal performance as an 
observer having full knowledge of all the geometric 
information contained within views. The reason was 
that the strongly two-dimensional observer has all the 
information necessary to reconstruct the essential ge- 
ometric information contained in the views: strongly 
view-based recognition is really nothing more than a 
change of coordinate system in which visual input is 
represented. And while we used the concrete exam- 
ple of objects consisting of point-like features, as used 
in prior work in the literature, the same approach 
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works for many other forms of view representations, 
for example, in terms of locations without known cor- 
respondences or gray-value pixel values. 

As a consequence, it is impossible to distin- 
guish definitively 3D model-based recognition from 
strongly view-based recognition by comparing the er- 
ror rates of different observers: both 3D model-based 
observers and view-based observers can achieve the 
same Bayes-optimal 3D recognition and model ac- 
quisition performance; cither of them may fall short 
if the observer is using a suboptimal implementation. 

These results seem to be in contradiction to those 
claimed in [6] [5]. In those papers, the authors de- 
fine "ideal" 2D observers and demonstrate that hu- 
man performance and 3D model-based recognition 
exceeds that of those ideal observers. However, while 
those papers compare human performance to some 
2D observers (and, in fact, observers that are Bayes- 
optimal for certain 2D matching problems [1]), the 
2D observers in those papers simply are not the best 
possible that can be constructed with 2D similarity 
methods and arbitrary combination of evidence pro- 
cedures. 

Whether any meaningful and testable hypotheses 
distinguishing view-based and 3D model-based recog- 
nition systems and strategies can be formulated at 
all remains to be seen. It might be useful to shift the 
debate from considerations of what operations are in- 
volved in the recognition of individual objects to the 
prior knowledge about the world that a 3D model- 
based system is created with. A Bayes-optimal 3D 
model-based system should be able to perform per- 
fect view generalization without any training, while 
a more general-purpose visual recognition system 
would require time to learn the view generalization 
function. On the other hand, a Bayes-optimal 3D 
model-based system might not be able to adapt well 
to objects whose appearance transforms in ways other 
than that expected of 3D models under changes in 
viewing position [?]. However, experimentation in 
these areas is difficult because "training" refers to the 
entire visual experience of a human observer through- 
out his life, not to the acquisition of individual object 
models. 

In fact, the considerations in the last section have 
shown that 3D model-based recognition systems that 



either just perform a maximum likelihood or MAP 
reconstruction of a 3D model from training views, 
or even systems that associate error bounds with 
such reconstructions, are not Bayes optimal for 3D 
recognition in general. Bayes optimal recognition in 
general requires correct modeling of the distribution 
P(M u \%j), and approximating that distribution well 
under the constraint that it be represented in terms 
of perturbations of a concrete 3D shape model may 
be very difficult and, in any case, is not usually at- 
tempted by 3D model-based recognition systems any- 
way. View-based models, instead, attempt to model 
P(V\u) or P(V\%j) directly without imposing the 
constraint that the representation of that density be 
tied somehow to a 3D shape model. Whether this is 
actually easier or more successful in practice remains 
to be seen, but it is certainly a valid alternative to 
3D shape models, and it allows us to explore a much 
larger space of possible probabilistic models. 

The reconstruction methods used in this paper are 
a mathematical device to establish statistical suffi- 
ciency. While reconstruction from distances could 
probably be accomplished by simple constraint prop- 
agation in hardware that might plausibly described as 
"neural" , this is entirely unnecessary. Any classifica- 
tion method that achieves Bayes-optimal asymptotic 
performance given enough training data would be ex- 
pected eventually learn the view generalization func- 
tion, whether it is expressed in terms of Euclidean 
distances to prototype views or in terms of coor- 
dinates. The coordinate transformation implied by 
view-based representations, using distances to proto- 
type views, does not seem particularly complex and 
might even simplify the learning problem for class 
conditional densities or view generalization functions. 
Therefore, we should not judge the plausibility of 
Bayes-optimal view-based recognition in an actual 
vision system by the mathematical techniques used 
in this paper for establishing statistical sufficiency. 
The question of whether we can construct Bayes- 
optimal view generalization functions that are based 
on strongly two-dimensional techniques is a question 
of complexity, as well as the distribution of actual 
shapes and views in the real world, and will be ad- 
dressed in a separate paper. 
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Here is a brief sketch of the proof of Lemma [TJ 

Lemma 2 The intersection of two hyperspheres A = 
{x G W l \{x - a) 2 = r 2 a } and B = {x G M. n \(x - b) = 
r 2 } of dimension n-1 is either empty, a single point, 
an n—2 dimensional hypersphere contained in an n—1 
dimensional linear subspace perpendicular to (b — a), 
or A = B. 

Assume we are given A and B. If a = b and r a = 
rt,, then A = B. If a = b and r a ^ r^, then the 
intersection is empty. Therefore, let us assume that 
a b and that there is a common point p G A, B. 
Without loss of generality, place a at the origin, a = 
0. Write p = X(b — a) + q = Xb + q, where b ■ q = 0. 
Plugging this into the equations for A and B, we 
obtain A 2 + q 2 = r 2 and (1 — A) 2 + q 2 = r\. Solving 
for A yields A = ^p(r 2 - r 2 + \\b\\) + 1 and \\q\\ = 

yr 2 —}?, which establishes the claim. □ 

Lemma 3 For a collection of n linearly independent 
vectors p\, . . . ,p n in R™, we can reconstruct the coor- 
dinates of the vectors from the collection of Euclidean 
distances <iy = \\pi —pj\\ up to a global translation, a 
global rotation, and mirror image reversal. 

If n = 1, we have a single point, which we place 
at the origin, giving us a solution up to transla- 
tion. Now, take distances dij for i,j < n — 1 
and apply the Lemma, giving a collection of points 
Pi, . . . ,p n -i G R" -1 . Map that solution into R™ by 
adding as the last coordinate to each vector; this 
corresponds to an arbitrary choice of rotation. Now 
consider the hyperspheres around each point pi with 
radius d n i . By Lemma [2] their intersection will be a 
linear subspace of dimension 1, containing a hyper- 
sphere of dimension 0, i.e., two points. It is left to 
the reader to prove that these are mirror symmetric 
around the plane {v G R™|u n = 0}. □ 

Lemma [TJ For a collection of N distinct vectors 
Pi, ■ ■ ■ ,pn that span R™, if N > n, we can reconstruct 
the coordinates of the vectors from the collection of 
Euclidean distances dij = \\pi — pj\\ up to a global 
translation, a global rotation, and mirror reversal. 
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Find a linearly independent subset of n vectors and 
apply Lemma [IJ giving pi, . . . ,p n . Now, consider the 
reconstruction of p q = p n +i, ■ ■ ■ ,Pn- Place spheres of 
radius d q i around each pi , i = 1, . . . , n and compute 
p q as the intersection of the linear subspaces from 
Lemma [21 The reader can prove for himself that seen 
that this intersection has to be unique. □ 



fj,(S,Ti) and the combinations of the similarity 
scores. Note that in this construction / is not even 
object-dependent. □ 

While the function u used in this construction hap- 
pens to be not continuous, a construction using a 
Hilbert curve (space filling curve) for l would allow 
us to derive essentially the same result. 



Appendix B 

In this Appendix, we construct a similarity function fi 
that permits exact reconstruction of V and T U i given 
only the values of /z(V, T^j) for a single u>. This is 
an alternative construction to that given in the text, 
which potentially required knowledge of the similar- 
ity of a target view to the training views for multiple 
objects u> and reconstructed views only up to a global 
translation, rotation, and mirror image. 

Theorem 2 There exists a real-valued junction /! : 

R 2k x R 2k anda f unction f ■ R ™ R guch that 

P(V\T U ...,T r ) = Ti), . . .,n{V,T r )). 

Here, [i is the "view similarity function" and / is 
the "combination of evidence function" . 

For the proof of this theorem, we require a family 
of functions (one for each value of k) u : M fc — ► R and 
its inverse l^ 1 : R — > R fc such that t _1 (t(x)) = x for 
any x in R fe . We can construct a function t eas- 
ily by interleaving the digits of the individual ar- 
guments. That is, let = $^L_ 00 djjTO 3 . Then, 

L ( x ) = X)jl-oo^divfc,jmodfclCH If x' = i{x) = 

Now, let v = (S,Ti) be the concatenation 
of the vectors S and Tj and let vs and vt 
denote the portions of the vector v corre- 
sponding to S and T respectively in such a 
concatenation. Choose /j,(S,Ti) — t((5, Tj)) 
and choose f(ni, ■ ■ ■ , fi r ) = P(S\Ti, . . . , T r ) — 
PirHp^sli-H^T, • ■ • , i-HlirW, ■ ■ .)• By 
construction, f(p(S, Ti), . . . , p,(S, T r )) = 
P(S\Tx, . . . ,T r ). We have therefore shown that 
any Bayes-optimal decision function based on 3D 
models can be expressed as a decision function 
involving only real-valued similarity functions 
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